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Abstract 

We extend a relaxation technique due to Bertsimas and Nino-Mora for the restless 
bandit problem to the case where arbitrary costs penalize switching between the ban- 
dits. We also construct a one-step lookahead policy using the solution of the relaxation. 
Computational experiments and a bound for approximate dynamic programming pro- 
vide some empirical support for the heuristic. 

1 Introduction 

We study the restless bandit problem (RBP) with general switching costs between the ban- 
dits, which could represent travel distances for example. This problem is an intractable 
extension of the multi-armed bandit problem (MABP), which can be described as follows. 
There are N projects, of which only one can be worked on at any time period. Project % 
is characterized at (discrete) time t by its state Xi(t), which belongs to a finite state space 
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Si. If project i is worked on at time t, one receives a reward a t r(xi(t)), where a G (0, 1) is 
a discount factor. The state Xi(t) then evolves to a new state according to given transition 
probabilities. The states of all idle projects are unaffected. The goal is to find a policy which 
decides at each time period which project to work on in order to maximize the expected sum 
of the discounted rewards over an infinite horizon. The MABP problem was first solved 
by Gittins [5]. He showed that it is possible to define separately for each project an index 
which is a function of the project state only, and that the optimal policy operates at each 
period the project with the greatest current index. Moreover, these indices can be calculated 
efficiently, as shown for example in [19]. 

Whittle [20] proposed an interesting modification of the model, called the restless bandit 
problem (RBP), which extends significantly the range of applications. In the RBP, one 
can activate several projects at each time period, and the projects that are not activated 
continue to evolve, possibly using different transition probabilities. Finding an optimal policy 
efficiently for the RBP is unlikely to be possible however, since the problem is PSPACE-hard 
[14], even in restricted cases. Nevertheless, Whittle proposed an index policy for the RBP 
which performs well in practice. 

Another extension of the MABP concerns the addition of costs for changing the currently 
active project. This problem, which we call the multi-armed bandit problem with switching 
costs (MABPSC), is of great interest to various applications, as discussed by [8], [9], [18], [10], 
in order to model for example set-up and tear-down costs in queuing networks, transition 
costs in a job search problem or transaction fees in a portfolio optimization problem. It is easy 
to see that the MABPSC is NP-hard, since the HAMILTON CYCLE problem is a special 
case of it [12]. The MABPSC has been studied in particular by Asawa and Teneketzis [1], 
and very recently by Glazebrook et al. [6] and Nino-Mora [13]. These authors are concerned 
with the case where the switching costs have a separable form = q + Cj, preserving the 
separable structure from the MABP, and design approximate index policies. 

Our work was motivated by an optimal aerial surveillance problem, where switching costs 
correspond to travel distances between inspection sites. Hence, the assumption on the sepa- 
rable form of the switching costs does not hold. This introduces additional coupling between 
the projects, and it is not clear then how to design index policies. Moreover, the sites con- 
tinue to evolve while not visited, and thus we are led to consider the restless bandit problem 
with switching costs (RBPSC). 



We adopt a computational approach to the RBPSC. We impose no restriction on the switch- 
ing costs, not even the triangle inequality. In Section 2, we formulate the problem as a Markov 
decision process (MDP), using the state-action frequency approach [4]. This yields a linear 
program, which we relax in section 3 by following an idea that Bertsimas and Nino-Mora 
developped for the RBP [3], optimizing over a restricted set of marginals of the occupation 
measure. The coupling introduced by the switching costs makes this relaxation significantly 
more challenging to develop than in the classical case, and the first contribution of the paper 
is to present valid constraints on the marginals improving the quality of the relaxation. This 
relaxation provides an efficiently computable bound on the achievable performance. Section 
4 describes how the relaxation can also be used to motivate a heuristic policy. This heuristic 
is based on approximate dynamic programming (ADP) techniques, but we also show how to 
recover it from the linear programming theory used by Bertsimas and Nino-Mora to design 
their primal-dual heuristic for the RBP. Section 5 presents numerical experiments comparing 
the heuristic to the performance bound. 

The advantage of using the approximate dynamic programming point of view is that a re- 
cently developed performance bound provides additional support for our heuristic. However, 
we do not consider in this paper the development of policies with an priori performance 
bound. Few results exist in the literature concerning such bounds. As remarked by Guha et 
al. [7], even the standard RBP is PSPACE-Hard to approximate to any non-trivial factor, 
unless some assumptions are made on the reward functions. 

2 Exact Formulation of the RBSC Problem 

We formulate the RBPSC using the linear programming approach to Markov decision pro- 
cesses [4], [16]. N projects are distributed in space at N sites, and M < N servers can 
be allocated to M different projects at each time period t = 1,2,.... In the following, we 
use the terms project and site interchangeably; likewise, agent and server have the same 
meaning. At each time period, each server must occupy one site, and different servers must 
occupy distinct sites. We say that a site is active at time t if it is visited by a server, and 
is passive otherwise. If a server travels from site k to site /, we incur a cost cm- Each site 
can be in one of a finite number of states x n G S n , for n = 1, . . . , N, and we denote the 
Cartesian product of the individual state spaces by S — S\ x . . . x Sn- If site n in state x n 
is visited, a reward r^(x n ) is earned, and its state changes to y n according to the transition 



probabilities p\ nVn - If the she is n °t visited, then a reward (potentially negative) r^(x n ) is 
earned for that site and its state changes according to the transition probabilities p° Xnyn ■ We 
assume that all sites change their states independently of each other. 

Let us denote the set {1, . . . , N} by [N]. We consider that when no agent is present at a given 
site, there is a fictitious agent called passive agent at that site. We also call the real agents 
active agents, since they collect active rewards. The transition of a passive agent between 
sites does not involve any switching cost, and when a passive agent is present at a site, the 
passive reward is earned. Therefore, we have a total of N agents including both the real and 
passive agents, and we can describe the positions of all agents by a vector s = (s 1 , . . . , sjv), 
which corresponds to a permutation of [N]. We denote the set of these permutation vectors 
by H[N], and we let the M first components correspond to the active agents. For example, 
with M = 2 and iV = 4, the vector (s 1 = 2, s 2 = 3, s 3 = 1, s 4 = 4) e II [ 4 ] means that agent 1 
is in site 2, agent 2 in site 3 and sites 1 and 4 are passive. 

For an agent % G [N], we refer to the set of the other agents by —%. If we fix Sj e [N] for some 
1 < % < N, then we write s_j to denote the vector (si, . . . , s^-i, Sj+i, . . . , sjv), and II[^]_ S . 
to denote the permutations of the set [N] — {sj}. In particular, we write ^ s eri[JV] t° 
denote the sum over all permutations of the positions of the agents —i, over the set of sites 
not occupied by agent i. We also write S-i to denote the cartesian product S± x . . . S^-i x 
Si+i x ... x (Sjv- 

The state of the system at time t can be described by the state of each site and the positions 
s G Il[jv] of the servers, including the passive ones. With this state description, we are able to 
handle any number M < N of agents as a parameter within the same framework. We denote 
the complete state by (x±, . . . , x^] Si, • • • , sjv) : = ( x ; s )- We can choose which sites are to be 
visited next, i.e., an action a belongs to the set Il[Ar] and corresponds to the assignment of 
the agents, including the passive ones, to the sites for the next time period. Once the sites 
to be visited are chosen, there are costs c Sia . for moving the active agent % from site s« to 
site Oj, including possibly a nonzero cost for staying at the same site. The immediate reward 
earned is 

M N 

tf((x;s),a):=^(r^ 0i )-c Siai )+ ^ r° ai (x ai ). 

i=l i=M+l 

We are given a distribution v on the initial state of the system, and we will assume a product 



form 

N 

z/(x;s) = JJ^(xi)5 di (sj), (1) 

i=i 

i.e., the initial states of the sites are independent random variables and server i leaves initially 
from site di, with d e n^]. 

The transition probability matrix has a particular structure, since the sites evolve inde- 
pendently and the transitions of the agents are deterministic. Let us write its elements 

^(x';s')a(x;s) = ^xax' Y\f =1 ^(d*), where 

M N 

^-iipu n ?w 

8=1 i=M+l 

The optimal infinite horizon discounted reward, multiplied by (1 — a), is the optimal value 
of the following linear program (LP) [4] 

maximize i?((x; s), a) p (x;s)ia (2) 

sen^jvj aen^j xes 

subject to 

/ J ^ J P(x';s'),a[^(x;s)(x / ; s') — a7 3 ( x '; S ') a (x;s)] 

s',aen [JV] x'e5 

= (l-«)z/(x;s), V(x,s) e<Sxn [JV] (3) 
P(x;s),a > 0, V ((x; s), a) G S x njvj. 

The variables {p( x;s ),a} of the LP, called state action frequencies or occupation measure, form 
a probability measure on the space of state-action pairs and an optimal policy can be recov- 
ered from an optimal solution for the LP. The formulation above is of little computational 
interest however since the number of variables and constraints is of the order of \S\ x (N\) 2 , 
that is, exponential in the size of the input. 

We can obtain the linear program dual to (2) by constructing it directly, or starting from 
Bellman's equation and using standard dynamic programming arguments [2, vol. 2, p. 53]. 



The decision variables {A xs } x>s of the dual correspond to the reward-to-go vector. We get 

minimize (1 — a) A X]S z/(x; s) (4) 

x,s 

s.t. A XiS - a^P xax A Xja > i?((x;s),a) , 

x&S 

VxG5,V(s,a) eUf N] . 

3 LP Relaxation of the RBPSC 

We compute a bound on the performance achievable by any assignment policy by relaxing 
the LP formulation. We start by rewriting the objective function (2) and we identify the 
relevant marginals of the occupation measure: 

EEE (rlM-c sa ) hrpuJ +r> a ) E fL* 

5=1 a =l x a ^Sa L \i=l / \i=M+l 

where the marginals appearing above are obtained as follows: 

Y Y Y ^). a ' 

x- e5- a s_i6lI[ JV ]_ s a_i6npv]- 

V (i,s,a) G [Af, x a e S a , (5) 
and the superscripts refer to the agents. 

Now to express the constraints, we will also need the following variables: 

P\x s ;s),a ~ 

Y Y Y ^)< a ' 

V (i, s, a) G [Af, x s eS s . (6) 

The variables in (5) (respectively (6)) can be interpreted as the frequency with which agent 
% switches from site s to site a and the destination (resp. origin) site is in state x a (resp. x s ). 
Note that this notation is somewhat redundant, since we can write the variables p\ x ..j)j as 
in (5) or (6). 

It is straightforward to see that the constraints (3) imply 

JV N 

YsP\*s;s), a ~ a Y Y p h*-> s o,s pl £^ M} ( 7 ) 

a=l x s £S s s'=l 

= (l-a)p s (x s )5 di (s) , V(i,s) G [N]\\/x s G S s , 



on the marginals, where 1{-} is the indicator function. However, there are additional rela- 
tions that must exist because the marginals are obtained from the same original occupation 
measure. These relations must be found to insure a sufficiently strong relaxation. Another 
intuitive way to think about this type of constraints is that we enforce sample path con- 
straints only in average [20]. First, from the definitions we have immediately: 

£ PU;s),a = £ Pk;s),a , V (», s, a) E [Af . (8) 

Now for the RBPSC, exactly one agent (active or passive) must be at each site at each 
period. The frequency with which the agents leave site j in state Xj should be equal to the 
frequency with which the agents move to site j in state Xj. So we expect that the following 
constraints should hold: 

N N N N 

££/^ = ££pW Vje eS,. (9) 

i=l a=l i=l s=l 

We now show that (9) are indeed valid constraints. We use the notation (x_j,i?_j) to mean 
that the j th component of the vector is ctj, and similiarly for (s_i, j). We have, starting from 
the definition (6) 

N N N 

EE^j> = E E E E P(x-j,*,);(s_i,;),a 

i=i a=i x_jes_j aerijjv] 1=1 s_ieri[jY]_j 

= E E E >'■■* .)—'• ( 10 ) 

The first equality comes from the fact that we count all the permutation vectors a by varying 
first the i th component dj from 1 to N. The second equality comes from the fact that we 
count all the permutations s by varying the position % where the component s« is equal to j 
(exactly one of the components of a permutation vector of H[n] has to be j). The proof that 
the right hand side of (9) is also equal to the quantity in (10) is identical. 

Here are two additional sets of valid constraints: 

E E E pU«= E E EdU*. vM)G[Af, (ii) 

s=l x s GS a ae[N]-a ke[N]-i s=l x a eS a 

N N 

E E E E E v(<,s)g[jv] 2 . (12) 

a=l x a eS a sE[N]-S k£[N]-i a=l x a eS a 



Intuitively, on the left hand side we have the probability that agent % does not go to site a 
(respectively does not leave from site s), which must equal the probability that some other 
agent k (passive or not) goes to site a (respectively leaves from site s). Again, these relations 
can be verified by inspection of (6). Indeed, in (11) for (i, a) fixed, similarly to (9), we have 
two equivalent ways of summing the occupation measure over all indices (x, s, a) such that 
none of the permutation vectors a with a in position i appears. On the left hand side of 
(11), we vary the coefficient in the set {1, . . . , N} \ {a}, whereas on the right hand side, 
we obtain the same result by forcing the element a to be in a position different from position 
i. Similarly, in (12), we have two ways of summing over all indices such that none of the 
permutation vectors s with s in position i appears. 

Finally we have obtained a relaxation for the RBPSC: 

Theorem 3.1. We can obtain an upper bound on the optimal reward achievable in the 
RBPSC by solving the following linear program: 

maximize 

N N 
s=l a=l x a eS a 

subject to 
(7), (8), (9),! 

P\x 8 ;s),a — > 
P\x a ;s),a > > 

There are now 0(N 3 x maxj j^l) variables p % , . s \ a ,p) x . s \ a , and constraints in the relaxed 
linear program, which is polynomial in the size of the input. From the remarks about the 
complexity of the problem, it is unlikely that a polynomial number of variables will suffice to 
formulate the RBPSC exactly. However, the addition of the constraints tying the marginals 
together helps reduce the size of the feasible region spanned by the decision vectors and 
improve the quality of the relaxation. Computing the optimal value of this linear program 
can be done in polynomial time, and provides an upper bound on the performance achievable 
by any policy for the original problem. 



M 



N 



{rl{x a ) - C sa ) Xy*a;s,a + r a( X a) ^ Pl a ,s,a 



(13) 



i=l 



yi = M+l 



LI), (12), 
V{i,s,a) e [Nf,x s e S s 

V(i,s,o) G [N} 3 ,x a e S a . 



3.1 Dual of the Relaxation 



It will be useful to consider the dual of the LP relaxation obtained in the previous paragraph, 
which we derive directly from (13). This dual program could be obtained by dynamic 
programming arguments, in the spirit of the original work of Whittle, incorporating the 
constraints (8), (9), (11), (12) using Lagrange multipliers. We obtain: 

N N 

minimize (1 - a) EE E ^(^)M s ) A s,* s ( 14 ) 

i=l s=l x s sS s 

subject to 

K,x a + /4,a + - E C + E O > , V (i, 8 , a) G [iV] 3 , s ? a, (15) 

- «E^£ M}A U - A,a ~ ^ a "£tf +J2&> r^ M \Xa) ~ C sa l{* < M} , 

V (i,s,a)e[N] 3 ,s^a, (16) 

k, Xs - a£?Sf }A U - £(C + C) + £(o + 6) > ^- M} (^) - < m} , 

V* G [TV]. (17) 

The optimal dual variables A* x are Lagrange multipliers corresponding to the constraints 
(7) and have a natural interpretation in terms of reward-to-go if site s is in state x s and 
visited by agent i. The optimal dual variables p? s a , K a>Xa , correspond to the additional 
constraints (8), (9), (11), and (12) respectively. We can obtain the optimal primal and dual 
variables simultaneously when solving the relaxation. For j = s or a, we also obtain the 
optimal reduced costs %. tSta ' % ajS , a are ec L ua l to the left hand side of the constraints (15), 
whereas 7* s a are equal to the difference between the left hand side and the right hand side 
of (16), or (17) if s = a. There is one such reduced cost for each variable p XiSta of the primal, 
and by complementary slackness, P^. jS)0 7^. )S)a — 0) where {p^. )S)tl } is the optimal solution of 
the primal. 



4 A Heuristic for the RBPSC 



The relaxation is also useful to actually design assignment policies for the agents. We 
present here a one-step lookahead policy and its relationship with the primal-dual heuristic 



of Bertsimas and Nino-Mora, developped for the RBP. 



4.1 One- Step Lookahead Policy 

Consider the multi-agent system in state (x;s), with s a permutation of [N]. Given the 
interpretation of the dual variables A*. Xs in terms of reward-to-go mentioned in section 3.1, 
it is natural to try to form an approximation J(x; s) of the global reward-to-go in state (x; s) 
as 

N 

J(xi, . . . , x N ; si, . . . , s N ) = Ki,x H > ( 18 ) 

i=i 

where X l Xs s . are the optimal values of the dual variables obtained when solving the LP 
relaxation. The separable form of this approximate cost function is useful to design an 
easily computable one-step lookahead policy [2], as follows. In state (x;s), we obtain the 
assignment w(x; s) of the agents as 



u 



x; s) e arg max < i?((x; s), a) +a ^ "P xa x' 3{yd\ a) > . (19) 
aeII[Jvl I Tts J 



In this computation, we replaced the true optimal cost function, which would provide an 
optimal policy, by the approximation J. Using (18), we can rewrite the maximization above 
as 

A' 



max 
a en [JV] - 



J2rn i>ai , (20) 



with 



Assuming that the optimal dual variables have been stored in memory, the evaluation of the 
terms Tn^ ai , for all (i,a>i), takes a time 0(N 2 max^ \Si\). The maximization (20) is then a 
linear assignment problem, which can be solved by linear programming or in time 0(N 3 ) 
by the Hungarian method [17]. Thus, the assignment can be computed at each time step in 
time (9(iV 2 maXj + N 3 ) by a centralized controller. 



4.2 Equivalence with the Primal-Dual Heuristic 



Recall from paragraph 3.1 that, when solving the linear programming relaxation, we can 
obtain the optimal primal variables {/%. jS)0 }, the dual variables {A* ^, /2* a , R a>Xa , Q, and 
the reduced costs {7^. )S)(X }. These reduced costs are nonnegative. Bertsimas and Nin5- 
Mora [3] motivated their primal-dual heuristic for the RBP using the following well-known 
interpretation of the reduced costs: starting from an optimal solution, 7( x - s ) a is the rate 
of decrease in the objective value of the primal linear program (13) per unit increase in the 
value of the variable p\ Xj . s ^ a - 

We use this interpretation and the following intuitive idea: when agent i is in site s in state 
x s and we decide to send it to site a in state x a , in some sense we are increasing the values of 
P\x s -s) a an d P\ Xa -s) a' which are the long-term probabilities of such transitions. In particular, 
we would like to keep the quantities P\ X] - S ) a found to be in the relaxation as close to as 
possible in the final solution. By complementary slackness it is only for these variables that 
we might have 7( x -s) a > 0. Hence, when the system is in state (x; s), we associate to each 
action a an index of undesirability 

N 

J((x;s),a)= (%s i ;s i ),a i +f(x ai ;s i ),a i )+ Yl ( 21 ) 

{ie[N]:si^ai} {ie[N]:si=ai} 

that is, we sum the reduced costs for the N different projects. Then we select an action 
a p d £ n^] that minimizes these indices: 

a pd (x; s) e argmin a {/((x; s), a)}. (22) 

We now show that this policy is in fact equivalent to the one-step lookahead policy described 
earlier. Using the expression for the reduced costs from paragraph 3.1, we can rewrite the 
indices in (21) more explicitely. The term 7^ a , + 7^ , s .^ a . of the sum (21) is equal to 

N N N N 

i'=l s '=l i'=l a'=l 

~ « Y,P^X^ - r T M} M + cs iai l{i < M}, 

after cancellation of p, l s . a . , and adding and subtracting ( l a . and . . This expression is valid 
for the terms 7? .„.-,„. as well. Now after summation over % G [N], the first two lines in 



expression (23) do not play any role in the minimization (22). This is obvious for the terms 
that do not depend on a. For the terms involving the Q., we can write 

Af Af Af Af Af Af 

j=l i'=l i'=l i=l i'=l a=l 

the last equality being true since a is just a permutation of {1, . . . , N}. Hence the sums 
involving the ( l a . cancel (in fact we even see that each individual sum is independent of the 
choice of a). As for the term ^2 i=l ^ ai ,x H i it is equal to Ylj=i ^j, Xj and so it is independent 
of the choice of a 6 II[jv]- We are left with the following optimization problem: 

f N 

a pd (x; s) e argmin a - £ (r^(x ai ) -c Siai l{i < M} + a^P^A^ 
L i=i x ai 

which after a sign change is seen to be exactly (20). We have shown the following 

Theorem 4.1. The primal-dual heuristic (22), based on the interpretation of the reduced 
costs of the LP relaxation, is equivalent to the one-step lookahead policy (20) assuming the 
separable approximation (18) for the reward-to-go. 

In view of this result, we obtain an alternative way to compute the one-step lookahead 
policy. The minimization (22) is again a linear assignment problem. If we can store the 
0(N 3 maXj(IS'jl)) optimal reduced costs instead of the 0(N 2 max^S*,!)) optimal dual vari- 
ables, there is just a linear cost involved in computing the indices /((x; s), a) of the problem 
at each period resulting in an overall 0(N 3 ) computational cost for the on-line maximization 
at each period. 




5 Numerical Experiments 



Table 1 presents numerical experiments on problems whose characteristics differently affect 
the performance of the heuristic described in section 4. Linear programs are implemented in 
AMPL and solved using CPLEX. Due to the size of the state space, the expected discounted 
reward of the heuristics is computed using Monte-Carlo simulations. The computation of 
each trajectory is terminated after a sufficiently large, but finite horizon: in our case, when 
a* times the maximal absolute value of any immediate reward becomes less than 1CT 6 . To 



reduce the amount of computation in the evaluation of the policies, we assume that the 
distribution of the initial states of the sites is deterministic. 

In a given problem, the number \Si\ of states is chosen to be the same for all projects, c/r is 
the ratio of the average switching cost divided by the average active reward. This is intended 
to give an idea of the importance of the switching costs in the particular experiment. The 
switching costs are always taken to be nonnegative. Z* is the optimal value of the problem, 
computed using (2), when possible. Z r is the optimal value of the relaxation and so provides 
an upper bound on the achievable performance. Z os i is the estimated expected value of the 
one-step lookahead policy. Z g is the estimated expected value of the greedy policy which is 
obtained by fixing the value of the A* . x , in (20) to zero, i.e., approximating the reward-to- 
go by zero. This greedy policy is actually optimal for the MABP with deteriorating active 
rewards, i.e., such that projects become less profitable as they are worked on [2, vol. 2, p. 69]. 
Problem 2 is of this type and shows that the one-step lookahead policy does not perform 
optimally in general. 

Problem 1 is a MABP. The heuristic is not optimal, so we see that we do not recover 
Gittins' policy. Hence the heuristic is also different from Whittle's in general, which reduces 
to Gittins' in the MAB case. In problem 3, we add transition costs to problem 2. The greedy 
policy is not optimal any more, and the one-step lookahead policy performs better in this 
case. Problem 4 is designed to make the greedy policy underperform: two remote sites have 
slightly larger initial rewards (taking into account the cost for reaching them), but the active 
rewards at these sites are rapidly decreasing and the agents are overall better off avoiding 
these sites. The greedy policy does not take into account the future transition costs incurred 
when leaving these sites. In this case, it turns out that the one-step lookahead is quasi- 
optimal. Problem 7 and 8 are larger scale problems, with up to 30 sites. The relaxation is 
computed in about 20 minutes on a standard desktop, showing the feasibility of the approach 
for this range of parameters. 

6 A "Performance" Bound 

In this section, we present a result that offers some insight into why we could expect the 
one-step lookahead policy to perform well if the linear programming relaxation of the original 
problem is sufficiently tight. We begin with the following 



Table 1: Numerical Experiments 



Problem 


a 


Z* 


Zf 


Z 9 


Zosl 


(N, M, \Si\,c/r) 












Problem 1 


0.5 


84.69 


85.21 


84.5 


84.3 


(4,1,3,0) 


0.9 


299.6 


301.4 


276 


294 




0.99 


2614.2 


2614 


2324 


2611 


Problem 2 


0.5 


84.13 


85.14 


84.1 


84.1 


(4,1,3,0) 


0.9 


231.0 


245.1 


231 


228 




0.99 


1337 


1339 


1337 


1336 


Problem 3 


0.5 


57.54 


59.32 


56.0 


57.3 


(4,1,3,0.6) 


0.9 


184.5 


185.0 


177 


183 




0.99 


1279 


1280 


1273 


1277 


Problem 4 


0.5 




165.7 


115 


165 


(4,2,5,1.39) 


0.9 




767.2 


661 


767 




0.95 




1518 


1403 


1518 


Problem 5 


0.5 




39.25 


38.5 


36.5 


(6,2,4,0) 


0.9 




214.0 


205 


198 




0.95 




431.6 


AAA 

414 


396 


Problem 6 


0.5 




9.727 


6.93 


8.24 


(6,2,4,1.51) 


0.9 




62.80 


38.0 


47.0 




0.95 




128.7 


78.0 


99.0 


Problem 7 


0.5 




196.5 


189 


194 


(20,15,3,1.16) 


0.9 




952.7 


877 


900 




0.95 




1899 


1747 


1776 


Problem 8 


0.5 




589.4 


566 


564 


(30,15,2,2.18) 


0.9 




2833 


2640 


2641 




0.95 




5642 


5218 


5246 



Lemma 6.1. The approximate reward-to-go (18) is a feasible solution for the original dual 
linear program (4)- 

Proof. Consider one constraint in the original dual LP (4), for a fixed state-action tuple 



(x, s, a). We consider a situation where Sj 7^ aj for all i G {1, ...,JV}. Summing the 
constraints (15) over i for the given values of x Si , Sj, a^, we get 

N N N N N N N 

^Si,a: Sj + E ~^ E ^i- 3 ^ ~~ E E ^ + E E 

i=l i=l i=l i=l i'=l i=l a'=l 

N N N 

= E A u. + E + E ;> °- 

i=l i=l i=l 

The cancellation follows from the discussion preceding theorem 4.1. Now summing the 
constraints (16) over i, we also get 

N N N N 

- "EE^^U - I>U - E^ ^ E^ M} (^j - Csiai i{i < m } . 

2=1 X a ■ Z=l 1=1 2 = 1 

Finally, we add these two inequalities. We obtain 

JV N N 

E A U - "EE^^U ^ E KF M} (^) - <w{< < Af» , 

i=l i=l x ai i=l 

which is the inequality obtained by using the vector (18) in the constraints of (4). 

The case where = Oj for some % is almost identical, considering the constraints (17) for 
the corresponding indices. □ 

In the following theorem, the occupation measure F a (v, u) is a vector of size |«S|, representing 
the discounted infinite horizon frequencies of the states under policy u and initial distribution 
v [4]. The proof of the theorem follows from the analysis presented in [15], see [11] for more 
details. 

Theorem 6.2. Let v be an initial distribution on the states, of the product form (1). Let 
J* be the optimal reward function, J be an approximation of this reward function which is 
feasible for the LP (4), and u be the associated one-step lookahead policy. Let F a (u,u) and 
Ja be the occupation measure vector and the expected reward associated to the policy u. Then 



v T {J* - Jr a ) < — !— F a (i/, u) T (J - J*). (24) 
1 — a 

From lemma 6.1, the theorem is true in particular for J formed according to (18). In words, 
it says that starting with a distribution v over the states, the difference in expected rewards 



between the optimal policy and the one-step lookahead policy is bounded by a weighted 
^-distance between the estimate J used in the design of the policy and the optimal value 
function J*. The weights are given by the occupation measure of the one-step lookahead 
policy It provides some motivation to obtain a good approximation J, i.e., a tight relaxation, 
which was an important element of this paper. 
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